feat: add chunk application stats #12797

jancionear · 2025-01-24T18:38:54Z

This is the first step towards per-chunk metrics (#12758).

This PR adds a new struct - ChunkApplyStats - which keeps information about things that happened
during chunk application. For example how many transactions there were, how many receipts, what were
the outgoing limits, how many receipts were forwarded, buffered, etc, etc.

For now ChunkApplyStats contain mainly data relevant to the bandwidth scheduler, in the future
more stats can be added to measure other things that we're interested in. I didn't want to add too
much stuff at once to keep the PR size reasonable.

There was already a struct called ApplyStats, but it was used only for the balance checker. I
replaced it with BalanceStats inside ChunkApplyStats.

ChunkApplyStats are returned in ApplyChunkResult and saved to the database for later use. A new
database column is added to keep the chunk application stats. The column is included in the standard
garbage collection logic to keep the size of saved data reasonable.

Running neard view-state chunk-apply-stats allows node operator to view chunk application stats
for a given chunk. Example output for a mainnet chunk:

Click to expand

$ ./neard view-state chunk-apply-stats --block-hash GKzyP7DVNw5ctUcBhRRkABMaC2giNSKK5oHCrRc9hnXH --shard-id 0
...
V0(
    ChunkApplyStatsV0 {
        height: 138121896,
        shard_id: 0,
        is_chunk_missing: false,
        transactions_num: 35,
        incoming_receipts_num: 103,
        receipt_sink: ReceiptSinkStats {
            outgoing_limits: {
                0: OutgoingLimitStats {
                    size: 102400,
                    gas: 18446744073709551615,
                },
                1: OutgoingLimitStats {
                    size: 4718592,
                    gas: 300000000000000000,
                },
                2: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                3: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                4: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
                5: OutgoingLimitStats {
                    size: 102400,
                    gas: 300000000000000000,
                },
            },
            forwarded_receipts: {
                0: ReceiptsStats {
                    num: 24,
                    total_size: 6801,
                    total_gas: 515985143008901,
                },
                2: ReceiptsStats {
                    num: 21,
                    total_size: 6962,
                    total_gas: 639171080456467,
                },
                3: ReceiptsStats {
                    num: 58,
                    total_size: 17843,
                    total_gas: 1213382619794847,
                },
                4: ReceiptsStats {
                    num: 20,
                    total_size: 6278,
                    total_gas: 235098003759589,
                },
                5: ReceiptsStats {
                    num: 4,
                    total_size: 2089,
                    total_gas: 245101556851946,
                },
            },
            buffered_receipts: {},
            final_outgoing_buffers: {
                0: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                2: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                3: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                4: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
                5: ReceiptsStats {
                    num: 0,
                    total_size: 0,
                    total_gas: 0,
                },
            },
            is_outgoing_metadata_ready: {
                0: false,
                2: false,
                3: false,
                4: false,
                5: false,
            },
            all_outgoing_metadatas_ready: false,
        },
        bandwidth_scheduler: BandwidthSchedulerStats {
            params: None,
            prev_bandwidth_requests: {},
            prev_bandwidth_requests_num: 0,
            time_to_run_ms: 0,
            granted_bandwidth: {},
            new_bandwidth_requests: {},
        },
        balance: BalanceStats {
            tx_burnt_amount: 4115983319195000000000,
            slashed_burnt_amount: 0,
            other_burnt_amount: 0,
            gas_deficit_amount: 0,
        },
    },
)

The stats are also available in ChainStore, making it easy to read them from tests.
In the future we could also add an RPC endpoint to make the stats available in debug-ui.

The PR is divided into commits for easier review.

jancionear · 2025-01-24T18:40:10Z

/cc @mooori @nagisa
We could add more stats to ChunkApplyStats to help analyze runtime performance - where the gas and time is spent, what limits were hit, etc.

jancionear · 2025-01-24T18:40:32Z

chain/chain/src/chain_update.rs

+                    *block_hash,
+                    shard_uid.shard_id(),
+                    apply_result.stats,
+                );


Saving chunk stats here means that only chunks applied inside blocks will have their stats saved. Stateless chunk validators will not save any stats. In the future we could change it to save it somewhere else, but it's good enough for the first version.

jancionear · 2025-01-24T18:41:06Z

core/store/src/columns.rs

@@ -462,7 +467,8 @@ impl DBCol {
            | DBCol::StateHeaders
            | DBCol::TransactionResultForBlock
            | DBCol::Transactions
-            | DBCol::StateShardUIdMapping => true,
+            | DBCol::StateShardUIdMapping
+            | DBCol::ChunkApplyStats => true,


I hope that marking this column as cold is enough to avoid garbage collection on archival nodes? I think these stats should be kept forever on archival nodes. They are not that big and it would be nice to be able to view stats for chunks older than three epochs.

jancionear · 2025-01-24T18:41:24Z

core/store/src/columns.rs

+    /// The stats can be read to analyze what happened during chunk application.
+    /// - *Rows*: BlockShardId (BlockHash || ShardId) - 40 bytes
+    /// - *Column type*: `ChunkApplyStats`
+    ChunkApplyStats,


At first I thought that I could use ChunkHash as a key in the database, but that doesn't really
work. The same chunk can be applied multiple times when there are missing chunks, and I think chunks
created using the same prev_block would have the same hash (?).

jancionear · 2025-01-24T18:41:41Z

chain/chain/src/garbage_collection.rs

@@ -648,6 +648,7 @@ impl<'a> ChainStoreUpdate<'a> {
            self.gc_outgoing_receipts(&block_hash, shard_id);
            self.gc_col(DBCol::IncomingReceipts, &block_shard_id);
            self.gc_col(DBCol::StateTransitionData, &block_shard_id);
+            self.gc_col(DBCol::ChunkApplyStats, &block_shard_id);


I wonder if we could use some other garbage collection logic to keep the stats for longer than three epochs. Maybe something similar to LatestWitnesses where the last N witnesses are kept in the database? It's annoying that useful data like these stats disappears after three epochs, especially in tests which have to run for a few epochs. Can be changed later.

Agreed that it would be cool to keep those longer and agreed to keep the first version simple.

jancionear · 2025-01-24T18:41:59Z

runtime/runtime/src/lib.rs

@@ -336,7 +327,7 @@ impl Runtime {
        apply_state: &ApplyState,
        signed_transaction: &SignedTransaction,
        transaction_cost: &TransactionCost,
-        stats: &mut ApplyStats,
+        stats: &mut ChunkApplyStatsV0,
    ) -> Result<(Receipt, ExecutionOutcomeWithId), InvalidTxError> {
        let span = tracing::Span::current();
        metrics::TRANSACTION_PROCESSED_TOTAL.inc();


Runtime metrics could probably be refactored so that first we collect the stats and at the very end
we record all of the stats in the metrics. Would reduce clutter in the runtime code.

codecov · 2025-01-24T19:13:50Z

Codecov Report

Attention: Patch coverage is 72.32472% with 75 lines in your changes missing coverage. Please review.

Project coverage is 70.39%. Comparing base (8b861f8) to head (95e1c4a).

Files with missing lines	Patch %	Lines
tools/state-viewer/src/commands.rs	0.00%	17 Missing ⚠️
chain/chain/src/store/mod.rs	36.00%	16 Missing ⚠️
runtime/runtime/src/lib.rs	60.00%	7 Missing and 3 partials ⚠️
core/store/src/adapter/chain_store.rs	0.00%	9 Missing ⚠️
core/primitives/src/chunk_apply_stats.rs	92.39%	7 Missing ⚠️
runtime/runtime/src/congestion_control.rs	87.71%	7 Missing ⚠️
tools/state-viewer/src/cli.rs	0.00%	6 Missing ⚠️
.../runtime-params-estimator/src/estimator_context.rs	0.00%	2 Missing ⚠️
core/primitives/src/bandwidth_scheduler.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12797      +/-   ##
==========================================
- Coverage   70.40%   70.39%   -0.02%     
==========================================
  Files         851      852       +1     
  Lines      174188   174429     +241     
  Branches   174188   174429     +241     
==========================================
+ Hits       122634   122785     +151     
- Misses      46311    46398      +87     
- Partials     5243     5246       +3

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.41% <0.00%> (-0.01%)`	⬇️
linux	`70.07% <72.32%> (+0.01%)`	⬆️
linux-nightly	`70.02% <72.32%> (-0.02%)`	⬇️
pytests	`1.70% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.52% <0.00%> (-0.01%)`	⬇️
unittests	`70.22% <72.32%> (-0.02%)`	⬇️
upgradability	`0.20% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nagisa · 2025-01-27T10:59:15Z

core/primitives/src/chunk_apply_stats.rs

@@ -0,0 +1,218 @@
+use std::collections::BTreeMap;


Does this need to be a part of primitives? Isn't there an obvious conceptual "producer" crate which all dependents use that could hold this type?

I initially put it in node-runtime, but then I needed the struct in near-store and that doesn't depend on node-runtime so I moved the struct to primitives. It's a primitive struct that is used in multiple crates, so that seemed like good fit.

In the future there might be more crates that make use of these stats, maybe a custom aggregator which downloads stats from multiple nodes and aggregate them somehow. It would be nice to have a small crate that the aggregator can import without importing all of runtime.

If there's a better place for it please let me know.

nagisa · 2025-01-27T11:01:21Z

core/primitives/src/chunk_apply_stats.rs

+/// Useful for debugging, metrics and sanity checks.
+#[derive(Debug, Clone, BorshSerialize, BorshDeserialize)]
+pub enum ChunkApplyStats {
+    V0(ChunkApplyStatsV0),


Would it be possible for us to find a way to avoid versioning headaches with this mostly internal data? I don't think it is going to be painful if we make the old data inaccessible if the schema changes, we should take advantage of that.

These stats might be consumed by other services in the future - debug ui, custom stats aggregators, etc, so I wanted to have a (mostly) stable interface that they could depend on. My first thought was to make it versioned, but maybe there's other ways to go about it.

tools/state-viewer/src/commands.rs

jancionear · 2025-01-31T16:59:11Z

It looks like the problem of disk filling up was caused by an unrelated issue (faulty rocksdb update) combined with insufficient memory on the node. Constant crashes and rocksdb acting up caused too much data to be written to disk.
Merging latest master where the rocksdb problem was fixed and running a node with appropriate amount of memory (64 GB) fixed the problem, the node is running fine now. (see disk space metrics)

The PR is ready for review

wacban

LGTM

wacban · 2025-02-04T08:00:56Z

chain/chain/src/store/mod.rs

@@ -1084,6 +1098,7 @@ pub struct ChainStoreUpdate<'a> {
    add_state_sync_infos: Vec<StateSyncInfo>,
    remove_state_sync_infos: Vec<CryptoHash>,
    challenged_blocks: HashSet<CryptoHash>,
+    chunk_apply_stats: HashMap<(CryptoHash, ShardId), ChunkApplyStats>,


small suggestion: shard uid is the better unique identifier of a shard. that being said it's often not readily available, in that case don't worry about it.

AFAIU from now on the plan is to add new shard ids instead of increasing UId versions, so it should be unique enough. ShardId is more user friendly so I went with that.

wacban · 2025-02-04T08:01:50Z

chain/chain/src/garbage_collection.rs

@@ -648,6 +648,7 @@ impl<'a> ChainStoreUpdate<'a> {
            self.gc_outgoing_receipts(&block_hash, shard_id);
            self.gc_col(DBCol::IncomingReceipts, &block_shard_id);
            self.gc_col(DBCol::StateTransitionData, &block_shard_id);
+            self.gc_col(DBCol::ChunkApplyStats, &block_shard_id);


Agreed that it would be cool to keep those longer and agreed to keep the first version simple.

wacban · 2025-02-04T08:05:43Z

chain/chain/src/types.rs

@@ -115,6 +116,7 @@ pub struct ApplyChunkResult {
    pub bandwidth_scheduler_state_hash: CryptoHash,
    /// Contracts accessed and deployed while applying the chunk.
    pub contract_updates: ContractUpdates,
+    pub stats: ChunkApplyStatsV0,


Why the versioned struct instead of the enum?

nit: please add a comment

wacban · 2025-02-04T08:08:43Z

core/primitives/src/chunk_apply_stats.rs

+    /// Was the chunk applied as a missing chunk (apply_old_chunk)
+    pub is_chunk_missing: bool,


Perhaps use the same schema as the chunks do - have height_included as a field and a method to check if a chunk is new or old. Not a biggie.

wacban · 2025-02-04T08:10:19Z

core/primitives/src/chunk_apply_stats.rs

+    /// Number of previous bandwidth requests (prev_bandwidth_requests.len()).
+    pub prev_bandwidth_requests_num: u64,


Given this can be derived can you remove it and add a method?

wacban · 2025-02-04T08:19:54Z

runtime/runtime/src/congestion_control.rs

+                inner.record_outgoing_buffer_stats();
+                *stats = inner.stats;


nit: Maybe just return the stats from inner and set directly?

wacban · 2025-02-04T08:20:46Z

runtime/runtime/src/congestion_control.rs

@@ -510,6 +533,7 @@ impl ReceiptSinkV2 {
        trie: &dyn TrieAccess,
        shard_layout: &ShardLayout,
        side_effects: bool,
+        stats: &mut ChunkApplyStatsV0,


nit: Maybe the ChunkApplyStats (without version)?

wacban · 2025-02-04T08:21:25Z

runtime/runtime/src/congestion_control.rs

+        for shard in self.outgoing_buffers.shards() {
+            let buffer = self.outgoing_buffers.to_shard(shard);


sanity check - Does this add to state witness size at all?

wacban · 2025-02-04T08:22:15Z

runtime/runtime/src/congestion_control.rs

+            }
+
+            match self.outgoing_metadatas.get_metadata_for_shard(&shard) {
+                Some(metadata) if metadata.total_receipts_num() == buffer.len() => {


What does this if do and why do you need it? Please add a comment.

wacban · 2025-02-04T08:23:50Z

runtime/runtime/src/lib.rs

+        processing_state.stats.transactions_num =
+            transactions.transactions.len().try_into().unwrap();
+        processing_state.stats.incoming_receipts_num = incoming_receipts.len().try_into().unwrap();
+        processing_state.stats.is_chunk_missing = !apply_state.is_new_chunk;


nit: rename is_chunk_missing to is_new_chunk. It's better for both consistency and it's generally good practice to use positive expressions in variable names.

jancionear added 10 commits January 24, 2025 17:23

Add ChunkApplyStats

14322e2

Replace old ApplyStats with the new ChunkApplyStats

58aed33

Add TODO

54a147b

Record stats during chunk application

903d592

Add ChunkApplyStats to ApplyChunkResult

7a56c4a

Add DBCol::ChunkApplyStats

9faeb48

Save ChunkApplyStats to the database

4aaa9e9

Garbage collect ChunkApplyStats

488cb79

Add ChainStore*::get_chunk_apply_stats

fad294f

Add view-state chunk-apply-stats

c09da48

jancionear requested a review from wacban January 24, 2025 18:38

jancionear requested a review from a team as a code owner January 24, 2025 18:38

jancionear requested a review from mooori January 24, 2025 18:39

jancionear commented Jan 24, 2025

View reviewed changes

jancionear added 3 commits January 24, 2025 18:49

spellcheck

7fdbbda

Fix nightly compilation

d28f43b

spellcheck again

ccd5e01

nagisa reviewed Jan 27, 2025

View reviewed changes

jancionear commented Jan 27, 2025

View reviewed changes

tools/state-viewer/src/commands.rs Outdated Show resolved Hide resolved

jancionear added 4 commits January 31, 2025 12:22

Merge branch 'master' into jan_chunkstats

d23358c

Fix after merge

95e1c4a

Merge branch 'master' into jan_chunkstats

11caf8a

spell of cspell

ccb1e6e

wacban approved these changes Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add chunk application stats #12797

feat: add chunk application stats #12797

jancionear commented Jan 24, 2025 •

edited

Loading

jancionear commented Jan 24, 2025

jancionear Jan 24, 2025 •

edited

Loading

jancionear Jan 24, 2025

jancionear Jan 24, 2025

jancionear Jan 24, 2025

wacban Feb 4, 2025

jancionear Jan 24, 2025

codecov bot commented Jan 24, 2025 •

edited

Loading

nagisa Jan 27, 2025

jancionear Jan 27, 2025

nagisa Jan 27, 2025

jancionear Jan 27, 2025

jancionear commented Jan 31, 2025 •

edited

Loading

wacban left a comment

wacban Feb 4, 2025

jancionear Feb 5, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

wacban Feb 4, 2025

		/// Was the chunk applied as a missing chunk (apply_old_chunk)
		pub is_chunk_missing: bool,

		/// Number of previous bandwidth requests (prev_bandwidth_requests.len()).
		pub prev_bandwidth_requests_num: u64,

		for shard in self.outgoing_buffers.shards() {
		let buffer = self.outgoing_buffers.to_shard(shard);

feat: add chunk application stats #12797

Are you sure you want to change the base?

feat: add chunk application stats #12797

Conversation

jancionear commented Jan 24, 2025 • edited Loading

jancionear commented Jan 24, 2025

jancionear Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 24, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Jan 31, 2025 • edited Loading

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Jan 24, 2025 •

edited

Loading

jancionear Jan 24, 2025 •

edited

Loading

codecov bot commented Jan 24, 2025 •

edited

Loading

jancionear commented Jan 31, 2025 •

edited

Loading